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Abstract — Cyclops  (cyclic-operations)  Tensor  Framework 
(CTF)  1  is  a  distributed  library  for  tensor  contractions.  CTF 
aims  to  scale  high-dimensional  tensor  contractions  such  as  those 
required  in  the  Coupled  Cluster  (CC)  electronic  structure  method 
to  massively-parallel  supercomputers.  The  framework  preserves 
tensor  structure  by  subdividing  tensors  cyclically,  producing  a 
regular  parallel  decomposition.  An  internal  virtualization  layer 
provides  completely  general  mapping  support  while  maintaining 
ideal  load  balance.  The  mapping  framework  decides  on  the  best 
mapping  for  each  tensor  contraction  at  run-time  via  explicit 
calculations  of  memory  usage  and  communication  volume.  CTF 
employs  a  general  redistribution  kernel,  which  transposes  tensors 
of  any  dimension  between  arbitrary  distributed  layouts,  yet 
touches  each  piece  of  data  only  once.  Sequential  symmetric 
contractions  are  reduced  to  matrix  multiplication  calls  via  ten¬ 
sor  index  transpositions  and  partial  unpacking.  The  user-level 
interface  elegantly  expresses  arbitrary-dimensional  generalized 
tensor  contractions  in  the  form  of  a  domain  specific  language. 
We  demonstrate  performance  of  CC  with  single  and  double 
excitations  on  8192  nodes  of  Blue  Gene/Q  and  show  that 
CTF  outperforms  NWChem  on  Cray  XE6  supercomputers  for 
benchmarked  systems. 

I.  Introduction 

Quantum  chemistry  is  the  field  of  science  focused  on  the 
application  of  quantum  mechanics  to  the  study  of  chemical 
problems.  While  far  from  the  only  tool  used  to  study  such 
problems,  quantum  chemistry  plays  a  role  in  elucidating  the 
design  of  new  materials  for  energy  capture  and  storage,  the 
mechanisms  of  combustion  and  atmospheric  processes,  and  the 
interaction  of  molecules  with  many  kinds  of  radiation,  which  is 
fundamental  to  probing  many  forms  of  matter  at  the  atomistic 
scale.  A  major  barrier  to  the  application  of  all  quantum  chem¬ 
istry  methods  is  their  steep  computational  cost.  As  a  result, 
high-performance  computers  have  been  used  for  computational 
quantum  chemistry  for  more  than  40  years  and  enormous  effort 
has  been  put  into  designing  algorithms  and  developing  soft¬ 
ware  that  enables  the  efficient  use  of  such  resources.  Among 
the  most  common  methods  of  quantum  chemistry  are  quantum 
many-body  (QMB)  methods,  which  attempt  to  explicitly  solve 
the  Schrodinger  equation  using  a  variety  of  ansatze.  The 
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explicit  treatment  of  electrons  in  molecules  leads  to  a  steep 
computational  cost,  which  is  nonetheless  often  of  polynomial 
complexity,  but  with  the  benefit  of  systematic  improvement 
via  a  series  of  related  ansatze.  The  Coupled  Cluster  (CC) 
family  of  methods  [1],  [2]  is  currently  the  most  popular  QMB 
method  in  chemistry  due  to  its  high  accuracy,  polynomial 
time  and  space  complexity,  and  systematic  improvability.  This 
paper  focuses  on  the  fundamental  kernels  of  Coupled  Cluster 
-  tensor  contractions  -  and  demonstrates  a  completely  new 
algorithmic  approach  that  has  the  potential  to  enable  these 
applications  on  state-of-the-art  architectures  while  achieving  a 
high  degree  of  efficiency  in  computation,  communication  and 
storage. 

We  present  a  general  parallel  decomposition  of  tensors  that 
efficiently  decomposes  packed  tensor  data  with  any  partial 
or  full  tensor  symmetry  of  any  dimension.  Our  mapping 
algorithms  rearrange  the  tensor  layouts  to  suit  any  given  tensor 
contraction.  This  tensor  decomposition  needs  minimal  padding 
and  has  a  regular  decomposition  that  allows  the  algorithm  to 
be  mapped  to  a  physical  network  topology  and  executed  with 
no  load  imbalance.  Since  this  decomposition  is  completely 
regular,  we  automatically  search  over  many  possible  mappings 
and  decompositions  at  run-time,  explicitly  calculating  the 
memory  usage,  communication  cost,  and  padding  they  require 
and  selecting  the  best  one.  Our  mapping  framework  considers 
mappings  that  unpack  or  replicate  the  tensor  data,  which  yields 
theoretically  optimal  communication  cost. 

We  implemented  these  mapping  algorithms  in  Cyclops  Ten¬ 
sor  Framework  (CTF),  a  distributed  tensor  contraction  library. 
We  expose  the  generality  of  this  framework  via  an  elegant 
interface  that  closely  corresponds  to  Einstein  notation,  capable 
of  performing  arbitrary  dimensional  contractions  on  symmetric 
tensors.  This  interface  is  a  domain  specific  language  well- 
suited  for  theoretical  chemists.  To  demonstrate  correctness  and 
performance  we  implemented  a  Coupled  Cluster  method  with 
single  and  double  excitations  using  this  infrastructure. 

The  contributions  of  this  paper  are 

•  a  communication-optimal  tensor  contraction  algorithm 

•  a  cyclic  tensor  decomposition  for  symmetric  tensors 

•  an  automatic  topology-aware  mapping  framework 


•  a  load-balanced  virtualization  scheme 

•  a  scalable  implementation  of  Coupled  Cluster 

In  this  paper,  we  will  start  by  giving  a  brief  overview  of 
Coupled  Cluster,  then  detail  related  work.  We  define  the 
Coupled  Cluster  Singles  and  Doubles  (CCSD)  equations  that 
are  realized  by  our  implementation.  After  presenting  this 
theory,  we  will  discuss  the  algorithm  we  use  to  parallelize 
the  CCSD  tensor  contractions.  The  implementation  of  the 
algorithm,  Cyclops  Tensor  Framework,  will  be  detailed.  We 
will  present  Blue  Gene/Q  and  Cray  results  for  CCSD  running 
on  top  of  CTF,  which  show  good  weak  scaling  and  outperform 
NWChem  [3], 

II.  Background 

Coupled  Cluster  (CC)  is  a  method  for  computing  an  approx¬ 
imate  solution  to  the  time-independent  Schrodinger  equation 
of  the  form 

H|tf>  =.E|*>, 

where  H  is  the  Hamiltonian,  E  is  the  energy,  and  T  is 
the  wave-function.  In  CC,  the  approximate  wave-function  is 
defined  in  exponential  form 

I*)  =eT|$0) 

where  <I>o)  is  a  one-electron  reference  wave-function,  usually 
a  Hartree-Fock  Slater  determinant.  The  T  operator  in  CC  has 
the  form 

T  =  Tx  +  T2  +  T3  . . . 

^  aa1  ■  ■  ■  aanain  ■  •  •  ail 

a\...an 

where  Tn  =  {t°' is  a  2 nth  rank  (dimension)  2  tensor 
representing  the  set  of  amplitudes  for  all  possible  excitations 
of  n  electrons  from  occupied  orbitals  in  the  reference  to  virtual 
(unoccupied)  orbitals.  Each  Tn  is  computed  via  a  series  of 
tensor  contractions  on  tensors  of  rank  r  £  {2, 4, . . .  2 n  +  2}. 
The  specific  tensor  contractions  depend  on  the  variation  of 
CC  and  can  be  derived  by  various  algebraic  or  diagrammatic 
methods  [4],  Using  the  truncated  operator  T  =  Ti  +  T2 
gives  the  method  commonly  known  as  CCSD  (Coupled  Cluster 
Singles  and  Doubles)  [5],  Removal  of  Ti  gives  the  basic  CCD 
method,  while  further  addition  of  T3  gives  the  CCSDT  (T 
-  triples)  method  [6],  [7]  and  T4  gives  the  CCSDTQ  (Q  - 
quadruples)  method  [8], 

Computationally,  tensor  contractions  can  be  reduced  to 
matrix  multiplication  via  index  reordering  (transposes).  This 
approach  is  efficient  and  commonly  used  for  contractions 
on  fully  dense  tensors.  However,  the  tensors  which  arise  in 
CC  usually  have  high-dimensional  structure.  In  particular, 
permutational  symmetry  or  skew-symmetry  (anti-symmetry) 
among  a  set  of  indices  implies  that  any  reordering  of  the  index 
set  within  the  tensor  will  give  the  same  value  (with  a  potential 
sign  change  for  anti-symmetry).  For  example,  elements  of  the 
2-particle  Hamiltonian  v®b  are  skew-symmetric  in  a,  b  and  in 

2We  will  use  the  term  dimension  to  refer  to  tensor  rank  or  order. 


i,j.  This  permutation  symmetry  arises  from  the  requirement 
that  the  wave-function  for  fermions  (bosons)  be  antisymmetric 
(symmetric)  under  the  interchange  of  particles.  So,  we  have 

vab  =  -vba  =  vba  =  -vab 
u?  u?  3'1  3l 

where  the  unique  part  of  the  tensor  needs  to  be  stored  is 
The  coupled  cluster  amplitudes  Tn  are  also  skew-symmetric 
for  some  spins,  and  can  have  symmetries  among  groups  of  n 
indices. 

In  general,  permutational  symmetry  of  n  indices  implies  that 
only  one  of  every  n!  values  in  the  full  tensor  is  unique.  This 
implies  that  it  suffices  to  store  only  1/n!  of  the  tensor  data.  In 
higher-order  methods  such  as  CCSDT  and  CCSDTQ,  which 
have  3-dimensional  and  4-dimensional  symmetries  as  well  as 
multiple  symmetric  index  groups  in  some  tensors,  the  storage 
reduction  provided  by  exploiting  symmetry  is  significant  (4- 
36  times  less  for  various  tensors  in  CCSDT  and  16-576  times 
less  for  CCSDTQ).  Further,  any  symmetry  preserved  within 
a  contraction  (e.g.  the  output  C  contains  indices  that  were 
symmetric  in  operands  A  or  B),  reduces  the  computational 
cost  relative  to  a  non-symmetric  contraction. 

The  challenge  in  exploiting  high-dimensional  symmetry  is 
that  the  contractions  can  no  longer  be  trivially  reduced  to  dense 
matrix  multiplication.  Further,  since  the  number  of  possible  as 
well  as  encountered  (partial)  permutational  symmetries  grows 
factorially  with  tensor  dimension,  it  is  difficult  to  generalize 
and  tiresome  to  specialize.  As  a  result,  most  implementations 
exploit  tensor  symmetry  to  a  limited  extent  and  perform 
redundant  work  and  communication  by  unpacking  or  padding 
tensors. 

III.  Previous  work 

We  provide  an  overview  of  existing  applications  and  known 
algorithms  for  distributed  memory  CC  and  tensor  contractions. 
We  also  discuss  parallel  numerical  linear  algebra  algorithms, 
in  particular  2.5D  algorithms  [9],  which  will  serve  as  a  key 
motivation  for  the  design  of  Cyclops  Tensor  Framework. 

A.  NWChem  and  TCE 

NWChem  [3]  is  a  computational  chemistry  software  pack¬ 
age  developed  for  massively  parallel  systems.  NWChem  in¬ 
cludes  implementations  of  CC  and  tensor  contractions,  which 
are  of  interest  in  our  analysis.  We  will  detail  the  paralleliza¬ 
tion  scheme  used  inside  NWChem  and  use  it  as  a  basis  of 
comparison  for  the  Cyclops  Tensor  Framework  design. 

NWChem  uses  the  Tensor  Contraction  Engine  (TCE)  [10], 
[11],  [12],  to  automatically  generate  sequences  of  tensor 
contractions  based  on  a  diagrammatic  representation  of  CC 
schemes.  TCE  attempts  to  form  the  most  efficient  sequence  of 
contractions  while  minimizing  memory  usage  of  intermediates 
(computed  tensors  that  are  neither  inputs  nor  outputs).  We  note 
that  TCE  or  a  similar  framework  can  function  with  any  dis¬ 
tributed  library  which  actually  executes  the  contractions.  Thus, 
TCE  can  be  combined  with  Cyclops  Tensor  Framework  since 
they  are  largely  orthogonal  components.  However,  the  tuning 
decisions  done  by  such  a  contraction-generation  layer  should 


be  coupled  with  performance  and  memory  usage  models  of 
the  underlying  contraction  framework.  In  addition,  one  of  the 
present  authors  is  working  on  a  new,  more  flexible  generator 
for  CC  contractions  which  could  be  more  tightly  coupled  to 
CTF. 

To  parallelize  and  execute  each  individual  contraction, 
NWChem  employs  the  Global  Arrays  (GA)  framework  [13], 
Global  Arrays  is  a  partitioned  global-address  space  model 
(PGAS)  and  allows  processors  to  access  (fetch)  data  which 
may  be  laid  out  physically  on  a  different  processor.  Data 
movement  within  GA  is  performed  via  one-sided  communica¬ 
tion,  thereby  avoiding  synchronization  among  communicating 
nodes,  while  fetching  distributed  data  on-demand.  NWChem 
performs  different  block  tensor  sub-contractions  on  all  pro¬ 
cessors  using  GA  as  the  underlying  communication  layer  to 
satisfy  dependencies  and  obtain  the  correct  blocks.  Since  this 
dynamically  scheduled  scheme  is  not  load  balanced,  NWChem 
uses  dynamic  load  balancing  among  the  processors.  Further, 
since  distribution  is  hidden  by  GA,  the  communication  pattern 
is  irregular  and  possibly  unbalanced.  Cyclops  Tensor  Frame¬ 
work  attempts  to  eliminate  the  scalability  bottlenecks  of  load 
imbalance  and  irregular  communication,  by  using  a  regular 
decomposition  which  employs  a  structured  communication 
pattern  well-suited  for  torus  network  architectures. 

B.  ACES  III  and  SIAL 

The  ACES  III  package  uses  the  SIAL  framework  [14],  [15] 
for  distributed  memory  tensor  contractions  in  coupled-cluster 
theory.  Like  the  NWChem  TCE,  SIAL  uses  tiling  to  extract 
parallelism  from  each  tensor  contraction.  However,  SIAL  has 
a  different  runtime  approach  that  does  not  require  active- 
messages,  but  rather  uses  intermittent  polling  (between  tile 
contractions)  to  respond  to  communication  requests,  so  SIAL 
can  be  implemented  using  MPI  two-sided  communication.  To- 
date,  ACES  III  has  not  implemented  arbitrary-order  tensor 
contractions  or  methods  beyond  CCSD(T),  so  no  direct  com¬ 
parison  can  be  made  for  such  cases. 

C.  MRCC 

MRCC  [16]  is  a  program  suite  which  performs  arbitrary- 
order  calculations  for  a  variety  of  CC  and  related  methods. 
Parallelism  is  enabled  to  a  limited  extent  by  either  using  a 
multi-threaded  BLAS  library  or  by  parallel  MPI  features  of 
the  program.  However,  the  scaling  performance  is  severely 
limited  due  to  highly  unordered  access  of  the  data  and  exces¬ 
sive  inter-node  communication.  MRCC  is  currently  the  only 
tenable  solution  for  performing  any  type  of  CC  calculation 
beyond  CCSDTQ,  and  the  lack  of  scalability  presents  a  serious 
bottleneck  in  many  calculations. 

MRCC  uses  a  string-based  approach  to  tensor  contractions 
which  originated  in  the  development  of  Full  Cl  codes.  In 
this  method,  the  tensors  are  stored  using  a  fully-packed 
representation,  but  must  be  partially  unpacked  in  order  for 
tensor  contractions  to  be  performed.  The  indices  of  the  tensors 
are  then  represented  by  index  “strings”  that  are  pre-generated 
and  then  looped  over  to  form  the  final  product.  The  innermost 


loop  contains  a  small  matrix- vector  multiply  operation  (the 
dimensions  of  this  operation  are  necessarily  small,  and  become 
smaller  with  increasing  level  of  excitation  as  this  loop  involves 
only  a  small  number  of  the  total  indices).  The  structured 
communication,  storage,  and  contractions  algorithms  that  we 
propose  in  the  Cyclops  Tensor  Framework  could  then  present 
a  significant  improvement  in  both  raw  efficiency  and  paral¬ 
lelization  of  tensor  contractions  relative  to  MRCC. 


D.  Lower  bounds  on  communication 


Since  tensor  contractions  are  closely  related  to  matrix 
multiplication  (MM),  it  is  of  much  interest  to  consider  the 
best  known  distributed  algorithms  for  MM.  Ideally,  the  per¬ 
formance  achieved  by  any  given  tensor  contraction  should 
approach  the  efficiency  of  matrix  multiplication,  and  generally 
the  latter  is  an  upper-bound.  In  particular,  we  would  like  to 
minimize  the  communication  (number  of  words  of  data  moved 
across  the  network  by  any  given  processor)  done  to  contract 
tensors.  Since  any  operation  within  a  tensor  contraction  maps 
to  three  tensors  the  same  lower  bound  argument  that  works 
for  matrix  multiplication  applies  to  tensor  contractions. 

Given  a  matrix  multiplication  or  contraction  that  requires 
F/p  multiplications  on  p  processors,  with  M  words  of  mem¬ 
ory  on  each  processor,  it  is  known  that  some  processor  must 
communicate  at  least 


w  =  n 


(1) 


words  of  data  [17],  [18],  [19].  If  the  tensor  or  matrices  are  of 
size  S  =  0(M  •  p),  the  communication  lower  bound  is 


We  label  this  lower  bound  as  W2D  because  it  is  achieved 
by  matrix  multiplication  algorithms  that  are  most  naturally 
described  on  a  2D  processor  grid.  In  particular,  blocked 
Cannon’s  algorithm  [20]  and  SUMMA  [21],  [22]  achieve 
this  communication  bandwidth  lower  bound.  We  can  also  see 
that,  assuming  the  initial  data  is  not  replicated  and  load- 
balanced,  there  is  an  absolute  (memory-size  insensitive)  lower- 
bound  [23],  [24], 

w3D  =  n  (  F  - 

\p2/3-VS  pj 

This  communication  lower-bound  can  be  achieved  by  per¬ 
forming  3D  blocking  on  the  computational  graph  rather  than 
simply  distributing  the  matrices.  An  old  algorithm  known 
as  3D  matrix  multiplication  has  been  shown  to  achieve  this 
communication  cost  [25],  [21],  [23],  [26], 

However,  in  practice,  most  applications  run  with  some 
bounded  amount  of  extra  available  memory.  2.5D  algorithms 
minimize  communication  cost  for  any  amount  of  physical 
memory.  In  particular,  if  all  operand  tensors  or  matrices  are  of 
size  S  =  0(M  • p/c ),  where  c  €  [1  ,p1/3],  the  communication 
lower  bound  is 

W2.5  d  = 


\Jp-  c  •  S  p 


Using  adaptive  replication  this  communication  lower-bound 
can  be  achieved  for  matrix  multiplication  as  well  as  other 
dense  linear  algebra  kernels  via  the  algorithms  presented  in  [9]. 
Its  also  important  to  note  that  2.5D  algorithms  can  map 
very  efficiently  to  torus  network  architectures  as  demonstrated 
in  [27].  We  demonstrate  an  algorithm  and  an  implementation 
of  a  tensor  contraction  framework  that  does  no  more  commu¬ 
nication  for  each  contraction  than  these  lower  bounds. 

IV.  Arbitrary-order  Coupled  Cluster 


spin-/3,  abij ....  For  example,  the  second  contraction  above 
becomes. 
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Coupled  Cluster  is  an  iterative  process,  where  in  each 
iteration,  the  new  set  of  amplitudes  T'  are  computed  from 
the  amplitudes  from  the  previous  iteration  T  and  from  the 
Hamiltonian  H  =  F  +  V.  The  diagonal  elements  of  the  one- 
particle  Hamiltonian  F  are  separated  out  as  a  factor  D,  giving 
a  final  schematic  form  similar  to  a  standard  Jacobi  iteration 
(although  V  still  contains  diagonal  elements). 


T'  =  D 


(F'  +  V)(l 
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The  expansion  of  the  exponential  operator  is  complete  at 
fourth  order  due  to  the  fact  that  the  Hamiltonian  includes 
only  one-  and  two-particle  parts.  The  specific  tensors  which 
compose  F',  V,  and  T  are 
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where  the  abcdef . . .  indices  refer  to  virtual  orbitals  while 
ijklmn . . .  refer  to  occupied  orbitals.  The  contractions  which 
must  be  done  can  be  derived  using  either  algebraic  or  dia¬ 
grammatic  techniques,  however  the  result  is  a  sequence  of 
contractions  such  as 


zT  = 


2  2—i  ef  ' or 
e  fm 
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Contractions  which  involve  multiple  T  tensors  are  factored 
into  a  sequence  of  contractions  involving  one  or  more  interme¬ 
diates,  such  that  each  contraction  is  a  binary  tensor  operation. 

The  equations,  as  written  above,  are  termed  the  “spin- 
orbital”  representation  in  that  the  indices  are  allowed  to  run 
over  orbitals  of  either  a  or  /3  spin,  while  only  amplitudes 
with  certain  combinations  of  spin  are  technically  allowed. 
Some  programs  use  this  representation  directly,  checking  each 
amplitude  or  block  of  amplitudes  individually  to  determine  if 
it  is  allowed  (and  hence  should  be  stored  and  operated  upon). 
However,  an  alternative  approach  is  the  use  the  spin-integrated 
equations  where  each  index  is  explicitly  spin-a,  abij ....  or 


While  the  number  of  contractions  is  increased,  the  total 
amount  of  data  which  must  be  stored  and  contracted  is 
reduced  compared  to  a  naive  implementation  of  the  spin- 
orbital  method,  and  without  the  overhead  of  explicit  spin¬ 
checking. 

The  amplitudes  (and  /  and  v  integrals)  have  implicit  per- 
mutational  symmetry.  Indices  which  appear  together  (meaning 
either  both  upper  or  lower  indices  of  a  tensor)  and  which  have 
the  same  spin  and  occupancy  may  be  interchanged  to  produce 
an  overall  minus  sign.  In  practice  this  allows  the  amplitudes 
to  be  stored  using  the  symmetric  packing  facilities  built  into 
CTF. 

A.  Interface  for  Tensor  Operations 

Cyclops  Tensor  Framework  provides  an  intuitive  domain 
specific  language  for  performing  tensor  contractions  and  other 
tensor  operations.  This  interface  is  implemented  using  operator 
overloading  and  templating  in  C++,  with  the  end  result  that 
tensor  contractions  can  be  programmed  in  the  exact  same 
syntax  as  they  are  defined  algebraically, 

ab  _  \  '  mnj.ef  ,ab 

Zi j  ~  2-,  Uef  lij  lmn 
efmn 

W[“MnI j”]  =  V[“MnEf ”]  *T[“EfIj”]; 

Z[“AbI j”]  =  W[“MnIj”]  *T[“AbMn”]; 

This  interface  naturally  supports  all  types  of  tensor  operations, 
not  just  contraction.  The  number  and  placement  of  the  unique 
indices  implicitly  defines  the  operation  or  operations  which 
are  to  be  performed.  For  example,  the  repetition  of  an  index 
within  an  input  tensor  which  does  not  appear  in  the  output 
tensor  defines  a  trace  over  that  index.  Similarly,  an  index 
which  appears  in  all  three  tensors  defines  a  type  of  “weighting” 
operation  while  an  index  which  appears  multiple  times  in  the 
input  and  once  in  the  output  will  operate  on  diagonal  or  semi- 
diagonal  elements  of  the  input  only.  The  weighting  operation 
deserves  special  attention  as  it  is  required  in  CC  to  produce 
the  new  amplitudes  T'  from  Z  =  HeT, 

T'  =  D  4Z 

T[“AbIj”]  =  Dinv[“AbI j”]  *  Z[“AbIj”]; 


Additionally,  Equation-of-Motion  CC  (EOM-CC)  and  many 
other  related  techniques  have  terms  that  require  computation 
of  only  the  diagonal  elements  of  a  tensor  contraction  or  require 
replication  of  the  result  along  one  or  more  dimensions,  both  of 
which  can  be  expressed  easily  and  succinctly  in  this  interface. 
For  example,  the  diagonal  tensor  elements  used  in  EOMIP- 
CCSD  include  terms  such  as, 

pjaij 
aij 
jjaij 
aij 

which  can  be  expressed  in  CTF  as, 

Hbar[“AIj”]  +=  W[“ljlj”]; 

Hbar[“AIj”]  +=  V[“IjAe”]  *  T[“AeIj”]; 

B.  Application  to  CCSD 

The  CCSD  model,  where  T  =  Ti  +  T2,  is  one  of  the 
most  widely  used  coupled  cluster  methods  as  it  provides  a 
good  compromise  between  efficiency  and  accuracy,  and  is 
fairly  straightforward  to  derive  and  implement.  In  particular, 
CCSD  is  only  slightly  more  computationally  expensive  than 
the  simpler  CCD  method  [28]  but  provides  greater  accuracy, 
especially  for  molecular  properties  such  as  the  gradient  and 
those  derived  from  response  theory.  Formally,  CCD  and  CCSD 
have  the  same  leading-order  cost:  O{n20n „),  where  n0  and  nv 
are  the  number  of  occupied  and  virtual  orbitals,  respectively. 

The  spin-orbital  equations  for  CCSD  are  relatively  simple, 
and  are,  in  factorized  form. 
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Recasting  these  equations  in  terms  of  spin-integrated  quanti¬ 
ties  adds  additional  contractions.  Most  terms  in  the  expression 
for  zf  and  are  unchanged  except  for  the  addition  of 
spin  labels  and  a  change  in  overall  factor.  However,  terms 
which  contain  permutation  operators  become  more  compli¬ 
cated,  especially  term  3  of  zft  (and  the  equation  for  its 
intermediate).  The  symmetry  properties  of  the  amplitudes 
allow  for  some  simplification,  though,  as  the  explicit  permu¬ 
tation  operators  can  be  implied  by  the  symmetry  relations  in 
the  specification  of  the  output  tensor.  For  example,  if  two 
antisymmetric  matrices  are  multiplied  together  and  the  result 
is  then  antisymmetrized,  the  result,  in  terms  of  fully  packed 
storage  requires  six  separate  operations.  All  of  these  operations 
can  be  represented  by  a  single  contraction  call  in  the  CTF 
interface  if  the  output  tensor  is  specified  to  have  antisymmetry. 


C[ab\  =  P^A[ac]  x  B[cb], 

C[a  <  b]  =  E  1^1°  <  c\B[c  <  b]  —  A[a  <  c]B[b  <  c] 

C 

— A[c  <  a]B[c  <  b]  —  A[b  <  c\B[c  <  a] 
+A[b  <  c]B[a  <  c]  +  A[c  <  b\B[c  <  a]} 

/*  A,  B,  and  C  are  antisymmetric  */ 
C[“ab”]  =A[“ac”]*B[“cb”]; 

An  interface  layer  to  automatically  produce  the  necessary  spin- 
integrated  contractions  has  also  been  implemented,  so  that  the 
code  can  be  written  entirely  in  terms  of  the  simple  spin-orbital 
quantities.  With  these  simplifications,  the  total  amount  of  code 
to  perform  a  single  CCSD  iteration  is  only  41  lines. 


C.  Higher-order  Coupled  Cluster 

Higher  order  CC  methods  (CCSDT,  CCSDTQ,  CCSDTQR 
etc.)  are  theoretically  very  similar  to  CCD  and  CCSD,  how¬ 
ever,  several  important  computational  distinctions  arise.  First, 
as  the  order  increases,  the  highest  set  of  Tn  amplitudes  grows 
relatively  much  larger  than  the  Hamiltonian  elements  and  the 


other  T  amplitudes.  The  computation  time  in  terms  of  FLOPS 
is  dominated  by  a  handful  of  contractions  involving  this  largest 
amplitude  set.  However,  the  sheer  number  of  small  contrac¬ 
tions  which  must  be  done  in  addition  can  instead  dominate 
the  wall  time  if  they  are  not  performed  as  efficiently  or  do 
not  parallelize  as  well.  Thus,  the  efficiency  of  small  tensor 
contractions  and  strong-scalability  of  the  parallel  algorithm 
become  relatively  much  more  important  for  higher  order  CC. 

Second,  since  the  total  memory  and/or  disk  space  available 
for  the  computation  is  effectively  constant,  high  orders  of  CC 
necessitate  the  use  of  a  smaller  number  of  occupied  and  virtual 
orbitals.  This  shrinks  the  length  of  each  tensor  dimension, 
threatening  vectorization  and  increasing  indexing  overhead. 
CTF  currently  uses  a  sequential  contraction  kernel  which  uses 
a  cyclic  blocking  with  a  fixed  tile  size  to  avoid  vectorization 
problems  for  packed  tensors.  While  extremely  short  edge 
lengths  could  still  cause  excessive  overhead  in  this  scheme 
(due  to  the  padding  needed  to  fill  out  the  fixed-length  tiles), 
good  performance  should  be  retained  in  most  circumstances. 


V.  Tensor  decomposition  and  contraction 
We  define  a  tensor  contraction  between  A  £  B  £ 
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Tensor  contractions  reduce  to  matrix  multiplication  via  index 
folding.  We  define  a  folding 


|*1  •  •  -in\ 
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for  instance  \ijk\  =  {i,j,  k}  — >  i  +  I\  ■  j  +  I\  ■  •  k.  Any 

contraction  can  be  folded  into  matrix  multiplication  in  the 
following  manner. 


\jl—jk+l-m\ 

So  here  A,  B,  and  C  can  be  treated  simply  as  matrices,  albeit, 
in  general,  the  index  ordering  may  have  to  be  transposed. 
Tensors  can  also  have  symmetry,  we  denote  antisymmetric 
(skew-symmetric)  index  groups  as 
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for  any  j,k  £  [l,n],  For  the  purpose  of  this  analysis,  we  will 
only  treat  antisymmetric  tensors,  for  symmetric  matrices  the 
non-zero  diagonals  require  more  special  consideration.  Using 
the  notation  7®n  =  I  x  •  •  •  x  /,  we  denote  a  packed  (folded) 

(n— 1) -times 

antisymmetric  layout  as  a  map  from  an  index  to  a  interval  of 
size  binomial  in  the  tensor  edge  length 


|*1<*2  <...*n|  =  {/»"}->  1: 


so  given  a  simple  contraction  of  antisymmetric  tensors,  such 
as. 


31  ■■■Ok+l-m 

we  can  compute  it  in  packed  antisymmetric  layout  via 


+  (k  +  l  m)\  ■  'y  ] 


The  above  contraction  is  an  example  where  all  symmetries  are 
preserved.  Any  preserved  symmetries  must  be  symmetries  of 
the  contraction  graph  G  =  ( V.  E )  where  vertices  are  triplets 
defined  by. 


lVai...akbi...bici...cTn  —  (t*Q 


!  bbi  ...bi ;  Cci  ...Cm  )  * 


Broken  symmetries  are  symmetries  which  exists  in  one  of  A, 
B,  or  C,  but  not  in  G.  For  example,  we  can  consider  the 
contraction 

C[ij]kl  =  EQr».7lfpol  '  bpk[ql] 

pq 

which  corresponds  to  contraction  graph  elements  V\i:Akipq. 
The  symmetry  [ij]  is  preserved  but  the  symmetries  [pq]  and 
[ql]  are  broken.  While  each  preserved  contraction  allows  a 
reduction  in  floating  point  operations,  broken  contractions 
allow  only  preservation  of  storage.  The  broken  symmetries 
can  be  unpacked  and  the  contraction  computed  as 

C\i<j\kl  ~  ^  *  y  '  &\i<j\pq  *  bpkql 
pq 

or  the  broken  symmetries  can  remain  folded,  in  which  case 
multiple  permutations  are  required, 

c\i<j\kl  =  2  y  '  a|i<j||p<(j|  *  bpk\q<l\  —  a\i<j\\p<q\  '  ^pfe|l<g| 

\p<q\ 

~a\i<j\\q<p\  '  bpk\q<l\  +  a\i<j\\q<p\  '  ^ pk\l<q\ 

Our  framework  makes  dynamic  decisions  to  unpack  broken 
symmetries  in  tensors  or  to  perform  the  packed  contraction 
permutations,  based  on  the  amount  of  memory  available.  We 
will  show  that  in  either  case,  our  approach  is  communication- 
optimal.  All  preserved  symmetries  are  always  kept  in  packed 
layout,  so  no  extra  computation  is  preformed. 


A.  Cyclic  tensor  decomposition 

A  blocked  distribution  implies  each  processor  owns  a  con¬ 
tiguous  piece  of  the  original  tensor.  In  a  cyclic  distribution,  a 
cyclic  phase  defines  the  periodicity  of  the  set  of  indices  whose 
elements  are  owned  by  a  single  processor.  For  example,  if 
a  vector  is  distributed  cyclically  among  4  processors,  each 
processor  owns  every  fourth  element  of  the  vector.  For  a 
tensor  of  dimension  d,  we  can  define  a  set  of  cyclic  phases 
(p0,Pi,  ■  ■  ■  ,Pd- 1),  such  that  processor  Pioti1,...,id_1  owns  all 
tensor  elements  whose  index  (jo,ji,  ■  ■  ■  ,jd-i)  satisfies 


jk  =  ifcmod(pfc) 
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Fig.  1.  The  load-imbalance  incurred  or  padding  necessary  for  blocked,  block-cyclic,  and  cyclic  layouts. 


for  all  k  £  {0, 1 ,d}.  A  block-cyclic  distribution  general¬ 
izes  blocked  and  cyclic  distributions,  by  distributing  contigu¬ 
ous  blocks  of  any  size  b  cyclically  among  processors.  Cyclic 
decompositions  are  commonly  used  in  parallel  numerical  lin¬ 
ear  algebra  algorithms  and  frameworks  such  as  ScaLAPACK 
(block-cyclic)  [29]  and  Elemental  (cyclic)  [30].  Our  method 
extends  this  decomposition  to  tensors. 

Like  matrix  multiplication,  tensor  contractions  are  invariant 
with  respect  to  a  similarity  permutation  on  A  and  B, 

PCPT  =  PA  ■  BPt  =  ( PAPt )  ■  ( PBPt ) 

This  invariance  means  that  we  can  permute  the  ordering  of 
rows  in  columns  in  a  matrix  or  slices  of  a  tensor,  so  long 
as  we  do  it  to  both  A  and  B  and  permute  PCPT  back 
to  C.  This  property  is  particularly  useful  when  considering 
cyclic  and  blocked  distributions  of  matrices  and  tensors.  We 
can  define  a  permutation,  P,  that  permutes  a  tensor  elements 
from  a  blocked  to  a  cyclic  layout.  Conversely,  we  can  run 
an  algorithm  on  a  cyclic  distribution  and  get  the  answer  in  a 
blocked  distribution  by  applying  a  permutation,  or  run  exactly 
the  same  algorithm  on  a  blocked  distribution  and  get  the  cyclic 
answer  by  applying  a  permutation. 

The  main  idea  behind  Cyclops  Tensor  Framework  is  to 
employ  a  cyclic  distribution  to  preserve  packed  symmetric 
structure  in  sub-tensors,  minimize  padding,  and  generate  a 
completely  regular  decomposition,  susceptible  to  classical  lin¬ 
ear  algebra  optimizations.  Each  processor  owns  a  cyclic  sub¬ 
tensor,  where  the  choice  of  cyclic  phases  in  each  dimension 
has  the  same  phase  for  all  symmetric  indices.  By  maintaining 
the  same  cyclic  phase  in  each  dimension,  the  algorithm  ensures 
that  each  the  sub-tensor  owned  by  any  processor  has  the 
same  structure  and  structure  as  the  whole  tensor.  Further, 
minimal  padding  on  each  sub-tensor  allows  for  every  sub¬ 
tensor  to  have  the  exact  same  shape  only  with  different  entries. 
Figure  1  demonstrates  the  difference  in  padding  (or  load- 
imbalance)  required  to  store  exactly  the  same  sub-tensors  on 
each  processor.  It  is  evident  that  only  a  cyclic  layout  can 
preserve  symmetry  as  well  as  maintain  load  balance.  Overall 
the  amount  of  padding  required  for  CTF  is  equivalent  to  setting 
the  block  size  b  =  p1'1 d,  since  we  must  add  up  the  padding  on 
each  processor. 


B.  Distributed  contraction  of  tensors 

Our  decomposition  gives  us  tensors  in  cyclic  layouts  dis¬ 
tributed  over  a  torus  topology.  These  tensors  can  be  replicated 
over  some  dimensions,  which  means  blocking  of  all  indices  is 
done.  The  distributed  algorithm  for  tensor  contractions  can  be 
efficiently  defined  as  a  generalized  nested  SUMMA  algorithm. 
If  the  dimensions  of  two  tensors  with  the  same  contraction 
index  are  mapped  onto  different  torus  dimensions,  a  SUMMA 
algorithm  is  done  on  the  plane  defined  by  the  two  torus 
dimensions.  For  each  pair  of  indices  mapped  in  this  way,  a 
nested  level  of  SUMMA  is  done. 

The  communication  cost  of  the  recursive  SUMMA  al¬ 
gorithm  is  asymptotically  the  same  as  a  single  SUMMA 
algorithm  done  on  a  matrix  of  the  same  size  as  the  tensor 
distributed  on  a  2D  network  with  the  same  number  of  nodes 
as  the  higher-dimensional  torus  network.  SUMMA  is  known 
to  be  communication-optimal  given  that  no  extra  memory  is 
utilized  (weak-scaling  regime).  Thus,  our  recursive  distributed 
contraction  algorithm  is  also  communication-optimal  for  a 
large  problem  size.  To  get  optimality  for  irregularly  shaped 
tensors,  we  also  make  the  dynamic  choice  of  which  pair  of 
tensors  needs  to  be  communicated  (multicasts  must  be  done 
on  A  and  B ,  and  reductions  on  C ). 

However,  we  are  also  interested  in  strong  scaling,  in  order 
to  compute  small  tensor  contractions  rapidly.  Such  efficiency 
is  necessary  for  higher  order  CC  methods  which  require  many 
different  contractions.  To  do  strong  scaling,  one  computes 
a  problem  of  the  same  size  on  more  processors  and  aims 
to  reduce  the  time  to  solution.  In  this  scaling  regime,  more 
memory  must  be  available  than  the  amount  necessary  to  store 
the  tensor  operands  and  output.  Therefore,  we  can  replicate 
tensor  data  and  avoid  communication.  We  always  replicate  the 
smallest  one  of  the  three  tensors  involved  in  the  contraction  to 
minimize  the  amount  of  memory  and  communication  overhead 
of  replication. 

By  taking  consideration  of  the  size  of  all  of  the  tensors 
and  doing  replication  up  to  the  given  memory  constraints,  we 
obtain  an  algorithm  that  partitions  the  tensor  data  in  a  com¬ 
munication  optimal  fashion.  In  particular,  if  we  must  compute 
F  multiplies  to  do  a  contraction,  we  exploit  as  much  memory 
as  possible  and  select  an  algorithm  that  minimizes  the  amount 


of  data  communicated  for  each  tensor.  Since  we  exploit  the 
maximum  replication  and  employ  an  optimal  algorithm  for 
contraction  along  each  pair  of  indices  the  communication 
bandwidth  cost  comes  out  to  be 

W  =  0  ( — 1-7=  -  M) 

\p-Vm  J 

where  M  is  the  memory.  This  —  M  term  is  achieved  by 
avoiding  the  migration  of  input  tensors  wherever  possible.  This 
communication  cost  matches  the  bandwidth  lower  bound  for 
matrix  multiplication  (Equation  1).  If  the  contraction  involves 
r  symmetric  permutations  due  to  broken  symmetries, 

W  =  o(r-  ( JUj—  -  m])  =  O  (  F '  -  tm]  . 

\  \p-y/M  JJ  \p-VM  J 

Evidently,  performing  symmetric  premutations  is  useful  only 
if  the  data-input  size  approaches  the  total  data-movement  cost 
for  the  contraction,  which  is  uncommon.  Note  that  if  adaptive 
replication  was  not  done,  performing  unpacking  could  be  faster 
than  doing  permutations,  since  the  unpacked  contraction  would 
effectively  utilize  more  available  memory. 

C.  On-node  contraction  of  tensors 

To  perform  the  on-node  contraction,  we  perform  non- 
symmetric  transposes  of  the  tensors.  In  particular,  we  move  all 
dimensions  which  do  not  correspond  to  groups  of  symmetric 
indices  whose  symmetry  is  broken  within  the  contraction.  If 
symmetries  are  not  broken,  we  can  simply  fold  the  symmetric 
indices  into  one  bigger  dimension  linearizing  the  packed 
layout.  We  perform  an  ordering  transposition  on  the  local 
tensor  data  to  move  linearized  dimensions  forward  and  the 
broken  symmetric  dimensions  in  the  back  of  the  tensors.  To 
do  a  sequential  contraction,  we  can  then  iterate  over  the  broken 
symmetric  indices  (or  unpack  the  symmetry)  and  call  matrix 
multiplication  over  the  linearized  indices.  For  instance,  the 
contraction  from  the  start  of  this  section, 

c[ij]kl  =  ^  a[ij}\pq]  '  b\pqk][rl\ 
pqr 

would  be  done  as  a  single  matrix  multiplication  for  each 
block,  if  all  the  broken  symmetries  are  unpacked.  However,  if 
all  the  broken  symmetries  are  kept  folded,  the  nonsymmetric 
transpose  would  push  forward  the  folded  index  corresponding 
to  \i  <  j\,  so  that  the  sequential  kernel  could  iterate  over  pqkrl 
and  call  a  scale  operation  for  each  |  i  <  j  \. 

VI.  Automatic  mapping  of  contractions 

Each  contraction  can  place  unique  restrictions  on  the  map¬ 
ping  of  the  tensors.  In  particular,  our  decomposition  needs 
all  symmetric  tensor  dimensions  to  be  mapped  with  the  same 
cyclic  phase.  Further,  we  must  satisfy  special  considerations 
for  each  contraction,  that  can  be  defined  in  terms  of  indices 
(we  will  call  them  paired  tensor  dimensions)  which  are  shared 
by  a  pair  of  tensors  in  the  contraction.  These  considerations 
are 

1)  dimensions  which  are  paired  must  be  mapped  with  the 
same  phase 


2)  for  the  paired  tensor  dimensions  which  are  mapped  to  dif¬ 
ferent  dimensions  of  the  processor  grid  (are  mismatched) 

a)  the  mappings  of  two  pairs  of  mismatched  dimensions 
cannot  share  dimensions  of  the  processor  grid 

b)  the  subspace  formed  by  the  mappings  of  the  mis¬ 
matched  paired  dimensions  must  span  all  input  data 

We  want  to  satisfy  these  constraints  for  a  general  case  of 
any  torus  network  of  any  dimension  and  shape,  and  be  able  to 
select  an  optimal  mapping.  The  mapping  framework  of  CTF 
achieves  this. 

A.  Topology-aware  network  mapping 

Any  torus  topology  can  be  folded  into  a  number  of  tori 
of  smaller  dimension.  Depending  on  the  dimensions  of  the 
tensors  and  the  torus  network,  the  tensor  should  be  mapped  to 
some  folding  of  the  network.  Given  a  folding  of  the  network, 
the  optimal  mapping  should  minimize  the  surface  area  of  the 
sub-tensors.  This  mapping  should  have  the  longest  indices 
of  the  largest  tensor  mapped  to  the  longest  processor  grid 
dimensions.  This  implies  a  greedy  index  assignment  algorithm 
can  efficiently  find  the  best  mapping  for  a  given  folded 
network.  Cyclops  Tensor  Framework  defines  all  foldings  for  a 
given  network  and  selects  mappings  onto  them  dynamically. 

Once  a  tensor  is  defined  or  a  tensor  contraction  is  invoked, 
CTF  searches  through  all  topologies  and  selects  the  best 
mapping  which  satisfies  the  constraints.  The  search  through 
mappings  is  done  entirely  in  parallel  among  processors,  then 
the  best  mapping  is  selected  across  all  processors.  The  map¬ 
ping  logic  is  done  without  reading  or  moving  any  of  the  tensor 
data  and  is  generally  composed  of  integer  logic  that  executes 
in  a  trivial  amount  of  time  with  respect  to  the  contraction.  We 
construct  a  ’ghost’  mapping  for  each  valid  topology  and  each 
ordering  of  tensors.  The  distributed  contraction  algorithm  is 
constructed  on  each  ghost  mapping,  and  its  communication 
and  memory  overheads  are  evaluated.  If  the  ghost  mapping 
is  suboptimal  it  is  thrown  out  without  ever  dictating  data 
movement.  Once  a  mapping  is  decided  upon,  the  tensors  are 
redistributed.  Mappings  in  which  tensors  are  replicated  are 
also  considered,  with  corresponding  replication  and  reduction 
kernels  generalizing  2.5D  algorithms. 

The  best  mapping  can  be  selected  according  to  a  perfor¬ 
mance  model.  The  amount  of  virtualization  (to  be  described 
in  the  next  section),  communication,  memory  usage,  and  nec¬ 
essary  redistributions  is  explicitly  calculated  for  each  mapping 
and  the  optimal  one  is  selected.  Generally,  we  attempt  to 
not  exceed  the  memory,  not  perform  more  than  some  factor 
of  virtualization  when  possible,  and  minimize  communication 
given  those  constraints.  However,  other  performance  models 
may  be  used  and  the  best  one  may  depend  on  the  architecture 
and  the  scientific  problem  of  interest. 

B.  Virtualization 

Cyclops  Tensor  Framework  performs  virtualization  to  create 
a  level  of  indirection  between  the  task  decomposition  and  the 
physical  network  topology.  We  provide  a  virtualization  scheme 
that  is  guaranteed  to  generate  a  load  balanced  decomposition 


Fig.  2.  Virtualization  as  used  in  CTF  to  perform  contractions.  This  diagram  demonstrates  a  mapping  for  a  contraction  of  the  form  cjj.;];  =  fT  ayfe;]  ■  • 

In  this  case,  we  have  a  4-by-2  processor  grid,  and  a  4-by-4-by-4  virtual  grid. 


for  any  given  tensor  contraction  (tensors  of  any  symmetry, 
any  dimension,  and  any  index  map  defining  the  contraction). 
Further,  we  parameterize  the  virtual  decomposition  so  that  it 
is  effectively  a  multiple  of  the  processor  grid,  which  insures 
that  each  processor  owns  the  same  number  of  sub-blocks.  This 
scheme  reduces  the  problem  of  mapping  tensors  with  symme¬ 
try  to  mapping  padded  tensors  with  no  symmetry.  For  example, 
in  Figure  2,  the  3D  virtualized  mapping  is  decomposed  among 
the  processors  so  that  each  processor  is  contracting  a  matrix 
of  symmetric  tensors  with  a  vector  of  symmetric  tensors  into 
a  matrix  of  symmetric  tensors.  The  mapping  is  defined  so  that 
by  the  time  the  distributed  contraction  algorithm  is  executed, 
it  need  not  be  aware  of  the  symmetry  of  the  sub-tensors  but 
only  of  their  size. 

We  do  not  use  a  dynamically  scheduled  virtualization  ap¬ 
proach  such  as  the  overdecomposition  embodied  by  the 
Charm-H-  runtime  system  [31].  Instead,  we  define  the  vir¬ 
tualization  so  that  its  dimensions  are  a  multiple  of  the 
physical  torus  dimensions  and  generate  a  regular  mapping. 
This  approach  maintains  perfect  load-balance  and  achieves 
high  communication  and  task  granularity  by  managing  each 
virtualized  sub-grid  explicitly  within  each  processor. 

C.  Redistribution  of  data 

To  satisfy  each  new  set  of  restrictions  for  a  contraction 
the  mapping  must  change  and  tensor  data  must  be  reshuffled 
among  processors  according  to  the  new  mapping.  Since  the 
redistribution  can  potentially  happen  between  every  contrac¬ 
tion,  an  efficient  implementation  is  necessary.  However,  the 
data  must  first  be  given  to  CTF  by  the  user  application.  We 
detail  a  scheme  for  input  and  output  of  data  by  key-value 
pairs,  as  well  as  a  much  more  efficient  algorithm  for  mapping- 
to-mapping  tensor  redistribution.  Since  Coupled  Cluster  and 
most  other  scientific  applications  are  iterative  and  perform 
sequences  of  operations  (contractions)  on  the  same  data,  we 
assume  input  and  output  of  data  will  happen  less  frequently 
than  contractions. 

To  support  general  and  simple  data  entry,  CTF  allows  the 
user  to  write  tensor  data  bulk-synchronously  into  the  tensor 
object  using  key-value  pairs.  This  allows  the  user  to  write 


data  from  any  distribution  that  was  previously  defined,  and 
to  do  so  with  any  desired  granularity  (all  datbe  a  at  once  or 
by  chunks).  Redistribution  happens  by  calculating  the  cyclic 
phase  of  each  key  to  determine  which  processor  it  belongs  on. 
Once  counts  are  assembled  the  data  is  redistributed  via  all-to- 
all  communication.  After  this  single  redistribution  phase,  each 
processor  should  receive  all  data  belonging  to  its  sub-tensors, 
and  can  simply  bin  by  virtual  block  then  sort  it  locally  to 
get  it  into  the  right  order.  This  key-value  binning  scheme  is 
essentially  as  expensive  as  a  parallel  sorting  algorithm. 

When  transitioning  between  distributions,  which  we  expect 
to  happen  much  more  frequently  than  between  the  application 
and  user,  we  can  take  advantage  of  existing  knowledge  about 
the  distribution.  Between  each  distribution  the  cyclic  phase 
along  each  dimension  of  the  tensor  can  potentially  change. 
This  implies  that  each  element  might  migrate  from  any  given 
processor  to  another.  Further,  depending  on  the  cyclic  phase, 
the  amount  of  padding  could  change  along  each  dimension 
of  the  tensor.  Additionally,  due  to  the  blocking  schemes 
employed  within  each  processor  (to  be  described  in  the  next 
section),  the  local  ordering  of  the  elements  on  each  processor 
can  change. 

Our  solution  is  to  communicate  the  data  while  preserving 
the  global  ordering  of  elements  in  the  communicated  buffer. 
Each  processor  iterates  over  its  local  data  in  the  order  of 
the  global  index  of  the  data,  and  computes  the  destination 
processor.  The  reverse  process  is  performed  in  order  for  each 
processor  to  determine  what  data  is  received  from  which 
processor. 

Redistributions  require  communication,  however,  they  are 
a  lower-order  communication  term  with  respect  to  a  tensor 
contraction.  Each  piece  of  data  must  be  migrated  only  once, 
and  a  trade-off  between  latency  and  bandwidth  is  made  by  the 
selection  of  an  all-to-all  algorithm. 

VII.  Parallel  performance 
A.  Implementation  details 

The  implementation  uses  no  external  libraries  except  for 
MPI  [32],  BLAS,  and  OpenMP.  All  code  is  tightly  integrated 
and  written  in  C/C++.  Computationally  expensive  routines  are 


BG/Q  matrix  multiplication 


CCSD  weak  scaling  on  Mira  (BG/Q) 


Fig.  3.  Figure  3(a)  displays  the  strong  scaling  of  matrix  multiplication  of  32K-by-32K  square  matrices.  Figure  3(b)  shows  weak  scaling  of  CCSD  on  Blue 
Gene/Q.  The  number  of  occupied  orbitals  ranged  from  100  to  250  and  the  number  of  virtual  orbitals  ranged  from  400  to  1000. 


threaded  and/or  parallelized  with  MPI.  Performance  profiling 
is  done  by  hand  and  with  TAU  [33], 

B.  Architectures 

Cyclops  Tensor  Framework  targets  massively  parallel  archi¬ 
tectures  and  is  designed  to  take  advantage  of  network  topolo¬ 
gies  and  communication  infrastructure  that  scale  to  millions 
of  nodes.  Parallel  scalability  on  commodity  clusters  should 
benefit  significantly  from  the  load  balanced  characteristics  of 
the  workload,  while  high-end  supercomputers  will  additionally 
benefit  from  reduced  inter-processor  communication  which 
typically  becomes  a  bottleneck  only  at  very  high  degrees  of 
parallelism.  We  collected  performance  results  on  two  state- 
of-the-art  supercomputer  architectures,  IBM  Blue  Gene/Q  and 
Cray  XE6.  We  also  tested  sequential  and  multi-threaded  per¬ 
formance  on  a  Xeon  desktop. 

The  sequential  and  non-parallel  multi-threaded  performance 
of  CTF  is  compared  to  NWChem  and  MRCC.  The  platform  is 
a  commodity  dual-socket  quad-core  Xeon  E5620  system.  On 
this  machine,  we  used  the  sequential  and  threaded  routines  of 
the  Intel  Math  Kernel  Library.  This  platform,  as  well  as  the 
problem  sizes  tested  reflect  a  typical  situation  for  workloads  on 
a  workstation  or  small  cluster,  which  is  where  the  sequential 
performance  of  these  codes  is  most  important.  Three  problem 
sizes  are  timed,  spanning  a  variety  of  ratios  of  the  number  of 
virtual  orbitals  to  occupied  orbitals. 

The  second  experimental  platform  is  ‘Hopper’,  which  is 
a  Cray  XE6  supercomputer,  built  from  dual-socket  12-core 
“Magny-Cours”  Opteron  compute  nodes.  We  used  the  Cray 
LibSci  BLAS  routines.  This  machine  is  located  at  the  NERSC 
supercomputing  facility.  Each  node  can  be  viewed  as  a  four- 
chip  compute  configuration  due  to  NUMA  domains.  Each 
of  these  four  chips  have  six  super-scalar,  out-of-order  cores 
running  at  2.1  GHz  with  private  64  KB  LI  and  512  KB 
L2  caches.  Nodes  are  connected  through  Cray’s  ‘Gemini’ 
network,  which  has  a  3D  torus  topology.  Each  Gemini  chip, 
which  is  shared  by  two  Hopper  nodes,  is  capable  of  9.8  GB/s 


bandwidth.  However,  the  NERSC  Cray  scheduler  does  not 
allocate  contiguous  partitions,  so  topology-aware  mapping 
onto  a  torus  cannot  currently  be  performed. 

The  final  platform  we  consider  is  the  IBM  Blue  Gene/Q 
(BG/Q)  architecture.  We  use  the  installations  at  Argonne  and 
Lawrence  Livermore  National  Laboratories.  On  both  installa¬ 
tions,  IBM  ESSL  was  used  for  BLAS  routines.  BG/Q  has  a 
number  of  novel  features,  including  a  5D  torus  interconnect 
and  16-core  SMP  processor  with  4-way  hardware  multi¬ 
threading,  transactional  memory  and  L2-mediated  atomic  op¬ 
erations  [34],  all  of  which  serve  to  enable  high  performance 
of  the  widely  portable  MPI/OpenMP  programming  model.  The 
BG/Q  cores  run  at  1 .6  GHz  and  the  QPX  vector  unit  supports 
4-way  fused  multiply-add  for  a  single-node  theoretical  peak 
of  204.8  GF/s.  The  BG/Q  torus  interconnect  provides  2  GB/s 
of  theoretical  peak  bandwidth  per  link  in  each  direction,  with 
simultaneous  communication  along  all  10  links  achieving  35.4 
GB/s  for  1  MB  messages  [35], 

C.  Results 

We  present  the  performance  of  a  CCSD  implementation  on 
top  of  Cyclops  Tensor  Framework.  The  CCSD  contraction 
code  was  extended  from  CCD  in  a  few  hours  of  work  and 
is  very  compact.  For  each  contraction,  written  in  one  line 
of  code,  CTF  finds  a  topology-aware  mapping  of  the  tensors 
to  the  computer  network  and  performs  the  necessary  set  of 
contractions  on  the  packed  structured  tensors. 

1 )  Sequential  performance:  The  results  of  the  sequential 
and  multi-threaded  comparison  are  summarized  in  Table  I.  The 
time  per  CCSD  iteration  is  lowest  for  NWChem  in  all  cases, 
and  similarly  highest  for  MRCC.  The  excessive  iteration  times 
for  MRCC  when  the  —  ratio  becomes  small  reflect  the  fact 
that  MRCC  is  largely  memory-bound,  as  contractions  are  per¬ 
formed  only  with  matrix-vector  products.  The  multi-threaded 
speedup  of  CTF  is  significantly  better  than  NWChem,  most 
likely  due  to  the  lack  of  multi-threading  of  tensor  transposition 
and  other  non-contraction  operations  in  NWChem. 


TABLE  I 

Sequential  and  non-parallel  multi-threaded  performance 
COMPARISON  OF  CTF,  NWCHEM,  AND  MRCC.  ENTRIES  ARE  AVERAGE 
TIME  FOR  ONE  CCSD  ITERATION,  FOR  THE  GIVEN  NUMBER  OF  VIRTUAL 
(n„)  AND  OCCUPIED  (ra0)  ORBITALS. 


nv  =  110 
n0  =  5 

nv  =  94 
n0  =  11 

nv  =  71 
n0  =  23 

NWChem 

CTF 

MRCC 

1  thread 

1  thread 

1  thread 

6.80  sec 
23.6  sec 
31.0  sec 

16.8  sec 
32.5  sec 
66.2  sec 

49.1  sec 
59.8  sec 
224.  sec 

NWChem 

CTF 

MRCC 

8  threads 

8  threads 

8  threads 

5.21  sec 
9.12  sec 
67.3  sec 

8.60  sec 
9.37  sec 
64.3  sec 

18.1  sec 

18.5  sec 

86.6  sec 

TABLE  II 

CCSD  ITERATION  TIME  ON  64  NODES  OF  HOPPER  FOR  nv  VIRTUAL 
ORBITALS  AND  n0  OCCUPIED  ORIBTALS: 


system 

n0 

nv 

CTF 

NWChem 

w5 

25 

180 

14  sec 

36  sec 

w7 

35 

252 

90  sec 

178  sec 

w9 

45 

324 

127  sec 

- 

w!2 

60 

432 

336  sec 

- 

2)  Performance  scalability:  On  the  Cray  XE6  machine,  we 
compared  the  performance  of  our  CCSD  implementation  with 
that  of  NWChem.  We  benchmarked  the  two  codes  for  a  series 
of  water  systems.  In  Table  II,  we  detail  the  best  time  to  solution 
achieved  for  each  water  problem  by  NWChem  and  CTF  on  64 
nodes  of  Hopper.  The  execution  of  NWChem  was  terminated 
if  it  did  not  complete  a  CCSD  iteration  by  half  an  hour  of 
execution,  which  is  denoted  by  a  dash  in  the  table.  NWChem 
completed  CCSD  for  the  w9  water  system  on  128  nodes,  at 
the  rate  of  223  sec/iteration.  On  128  nodes,  CTF  performed 
this  task  in  73  sec/iteration,  3-times  faster  than  NWChem. 

As  a  simple  benchmark  of  the  mapping  framework,  we 
compare  the  performance  of  matrix  multiplication  (which 
is  a  tensor  contraction),  done  by  CTF  with  the  distributed 
matrix  multiplication  performance  of  ScaFAPACK  [36]  on 
Blue  Gene/Q.  CTF  performs  topology-aware  mapping  on  the 
architecture  and  employs  optimized  collective  communication. 
As  a  result  (Figure  3(a))  CTF  achieves  significantly  better 
strong  scalability  and  achieves  one  petaflop/s  on  16,384  nodes 
(262K  cores). 

The  parallel  weak  scaling  efficiency  of  our  CCSD  imple¬ 
mentation  on  Blue  Gene/Q  is  displayed  in  Figure  3(b).  This 
weak  scaling  data  was  collected  by  doing  the  largest  CCSD  run 
that  would  fit  in  memory  on  each  node  count  and  normalizing 
the  efficiency  by  the  operation  count.  Going  from  512  to 
8192  nodes  (130K  cores),  the  efficiency  actually  increases, 
since  larger  CCSD  problems  can  be  done,  which  increases 
the  ratio  of  computation  over  communication.  We  maintain 
high  efficiency  (30%  of  theoretical  peak)  to  8,192  nodes  of 
BG/Q.  The  application  was  run  with  4  MPI  processes  per 
node  and  16  threads  per  process.  Results  at  higher  scales  are 
expected  to  improve  by  reducing  the  number  of  MPI  ranks  and 
running  with  one  process  per  node.  However,  this  will  require 
running  with  32-64  threads,  a  challenge  for  index  transposition 
and  redistribution  kernels.  Investigation  of  better  blocking  and 


TABLE  in 

A  PERFORMANCE  BREAKDOWN  OF  IMPORTANT  KERNELS  FOR  A  CCSD 
ITERATION  DONE  BY  CTF  ON  A  SYSTEM  WITH  na  =  200  OCCUPIED 
ORIBTALS  AND  nv  =  800  VIRTUAL  ORBITALS  ON  4096  NODES  (65K 
cores)  of  Mira. 


kernel 

%  of  time 

complexity 

architectural  bounds 

matrix  mult. 

45% 

O(n*n'20/p) 

flops/mem  bandwidth 

broadcasts 

20% 

0{n*n20/p\jM) 

multicast  bandwidth 

prefix  sum 

10% 

o(P) 

allreduce  bandwidth 

data  packing 

7% 

0(n'in'i/p) 

integer  ops 

all-to-all-v 

7% 

0{nini/p) 

bisection  bandwidth 

tensor  folding 

4% 

0(nin'i/p) 

memory  bandwidth 

threading  schemes  for  transposition  kernels  will  be  necessary. 

Table  III  lists  profiling  data  for  a  run  of  CTF  on  4096  nodes 
(65K  cores)  of  BG/Q.  Nearly  half  the  execution  is  spent  in 
matrix  multiplication,  showing  the  relatively  high  efficiency 
of  this  calculation  (24%  of  theoretical  floating  point  peak). 
The  prefix  sum,  data  packing,  and  all-to-all-v  operations  are 
all  part  of  tensor  redistribution,  which  has  a  large  effect  on 
performance.  The  table  lists  the  architectural  bounds  for  each 
kernel,  demonstrating  that  the  application  is  stressing  many 
components  of  the  hardware  with  significant  computations. 

VIII.  Future  work 

Different  types  of  sparsity  in  tensors  will  also  be  considered 
in  Cyclops  Tensor  Framework.  Tensors  with  banded  sparsity 
structure  can  be  decomposed  cyclically  so  as  to  preserve 
band  structure  in  the  same  way  CTF  preserves  symmetry. 
Completely  unstructured  tensors  can  also  be  decomposed 
cyclically,  though  the  decomposition  would  need  to  perform 
load  balancing  in  the  mapping  and  execution  logic. 

Cyclops  Tensor  Framework  will  also  be  integrated  with  a 
higher-level  tensor  manipulation  framework  as  well  as  CC 
code  generation  methods.  We  have  shown  a  working  imple¬ 
mentation  of  CCSD  on  top  of  CTF,  but  aim  to  implement 
much  more  complex  methods.  In  particular,  we  are  targeting 
the  CCSDTQ  method,  which  employs  tensors  of  dimension  up 
to  8  and  gets  the  highest  accuracy  of  any  desirable  CC  method 
(excitations  past  quadruples  have  a  negligible  contribution). 
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